Food detection

Objective:

</br>This is a notebook that aims to detect the presence of food in images! In addition, the Grad-CAM is used to visualize the most important features of the image when the model is classifying.

</br>Methodology:

5 approaches were used for training and evaluating:

</br>Results

Training acc | Test accuracy:

  1. DNN: 92% | 69%
  2. VGG + fully: 98% | 90%
  3. VGG + SVM: 98% | 89%
  4. New CNN: 100% | 84%
  5. Bootstrap DTC: | **74%***

Grad-CAM:

Grad-CAM was used for the convolutional models (2 and 4) and pointed out interesting outcomes ####################################"

Introduction

Image classification tasks has long been studied and it is an important field of machine learning and artificial intelligence. Normally associated with complex models, detecting and correctly classifying a figure/image is not trivial for the computer as it is for us, humans. For that reason, several models exist nowadays and deep neural network gained its space with the advancements of micro-processors, overcoming time- and memory-constraints.

Despite the success of using convolutional neural networks for dechyphering and understanding the sublte meanings of an image, simple and light classification models (e.g. SVM, Linear Regression) have always caught the attention for their interpretability and are still explored to modelize high-dimensional problems (such image classification).

But the problem scope goes beyond of "how much complex a model is". For instance, Joutout et al. [2] proposed a SVM classifier for such purpose and, despite the "simplicity" of the model, data acquired to predict spherical fruits involves a laser-scanning to obtain reflectance and range precision, outcoming the fruit shape and color.

Particularly, recognizing the presence of food items on images is also a challenge and Jimenez et al. proposed one of the first methods back on 1999 [1].

Alghough the difficult to classify food items, as they are strogly related to color and shape [3], novel methods have been tested and combination of multiples CNN models can already predict Mediterranean Diet food items with an accuracy of 52,71% [4]

[2] Farinella, G. M., Allegra, D., Stanco, F., & Battiato, S. (2015, September). On the exploitation of one class classification to distinguish food vs non

[1] Joutou, T., Yanai, K.: A food image recognition system with multiple kernel learning. In IEEE International Conference in Image Processing, pp. 285–288 (2009)

[3] Farinella, G. M., Allegra, D., & Stanco, F. (2014, September). A benchmark dataset to study the representation of food images. In European Conference on Computer Vision (pp. 584 - 599). Springer, Cham

[4] Papathanail, Ioannis; Lu, Ya; Vasiloglou, Maria; Stathopoulou, Thomai; Ghosh, Arindam; Faeh, David; Mougiakakou, Stavroula (March 2021). FOOD RECOGNITION IN ASSESSING THE MEDITERRANEAN DIET: A HIERARCHICAL APPROACH (Unpublished). In: 14th International Conference on Advanced Technologies & Treatments for Diabetes

Transfer Learning

Due to

References: pretrained models with keras

transfer with CNN

TensorFlow guide

Data processing

Although the images source for training and testing our models comes all from the same dataset (TRAIN and TEST folder), we prepared 3 types of processed data in order to train the different models.

Regardless of the model, due to computational process limits, the images had to be resized in 1/4 the original (from 240x320 to 60x80).

Furthermore, for training the new CNN, the images had also to be converted to grayscale. Despite this is a topic still in discussion in the scientific community, the accuracy for both methods (RGB and grayscale) don't differ too much [5][6] (check the image below).

data_inet: for VGG16 model. input : (samples, 60, 80, 3)

data: for own architecture. input : (samples, 60, 80, 1)

data_stack : for simple DNN. input : (samples, 4800)

For more information about the differences training CNN with grayscale and RGB images:

[5] Convolutional neural network for human micro-Doppler classification

[6] Color-to-Grayscale: Does the Method Matter in Image Recognition?

Loss-accuracy-curves-of-RGB-grayscale-images.png

To save time, data were processed in a first time (refer to getalldata.py script) and downloaded for ulterior uses.

Code for Grad-Cam

The code for the Grad-CAM used can be found on the script gradcam.py

References:

  1. https://medium.com/@daniel.reiff2/understand-your-algorithm-with-grad-cam-d3b62fce353

  2. https://medium.com/@stepanulyanin/implementing-grad-cam-in-pytorch-ea0937c31e82

  3. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra https://arxiv.org/abs/1610.02391

Function to make single predictions while visualizing the image.

Two possibilities were taken into account with respect to different models predicting:

While the new CNN model could predict directly from the image input, the pre-trained models had first to pass the image through the base convolutional layers before predicting.

1st approach: Simple DNN

If CNN are largely used to perform image classification tasks, DNN were the basis for the first models learning to recognize images. For the sake of curiosity, a simple (really simple) DNN is built here.

It is composed of no more than dense layers!

</br>Parameters

batch-size = 32

rlrop patience = 50

optimizer = Adam()

epochs = 500

</br>Tuning the parameters

In the first approaches, the training accuracy increased until 0.8, falling suddenly to a stead value of 0.5 until the end of training.

Taking into account the problem of vanishing gradients and local maximum, a possible solution for such behavior could be a variant learning rate, that takes into account an "unchanged accuracy value" through a certain period of training.

For that reason, a ReduceLRonPlateau algorithm was used in order to scheduled change the learning rate when reaching a plateau of learning.

This solution solved the problem with a factor change of 0.001 in the learning rate for each 50 epochs showing no accuracy improvement.

2nd approach: pre-trained VGG16 + training fully-connected layer

</br> Transfer Learning

"Standing on the shoulder of giants"

Although all the pre-trained models for image classification should have a great accuracy for most part of the classification tasks, the use of vgg16 was prioritized as its training dataset (Imagenet) has food images[7].

References:

[7] https://arxiv.org/pdf/2004.03357.pdf

pretrained models with keras

transfer with CNN

TensorFlow guide

For this model, we'll use the data in the 3 channels format (60, 80, 3) asvgg16 was trained with 3 channels

Extract the features for both TRAINING and VALIDATION

Train the fully connecte layers

Visualize predictions

3rd approach : VGG16 + SVM on top

Visualize predictions

Visualize GRAD-CAM for the VGG16 pre trained model

grad-cam for all conv layers in the model

Keras own architecture CNN

Visualize GRAD-CAM

Visualize grad-cam for multiple conv layers

Interestingly, we can see that the first conv layer conv2d_17 has a more defined gradient output (a more homogenous gradient of color) whereas the subsequents layers, seem to learn other patterns beyond the food shape (we can see some activation importance coming from the plate as well!!)

All confusion matrix

Comparison for the 2 CNN models

New CNN attempts

Due to time and computational process constraints, I'll perform several new cnn attempts with only 20 epochs, changing some parameters like:

From these first round of attempts, a final CNN version will be trained with more layers added and 50 epochs training with data augmentation.

generalizing the concept of human learning, where taking the VGG16 architecture as a reference (where the kernel size keeps (3,3))

Current model add:

  1. Dropout layers
  2. kernel_regularizer
  3. try also with kernel_init = "he_normal"
  4. augment data

New model:

  1. Dropout layers
  2. kernel_reg
  3. MaxPooling AND BatchNormalization
  4. kernel_size = 5x5 and 7x7 at some point to get bigger features
  5. augment data

Lets now check some graphs for the val_loss and val_accuracy for these round of trained models.

The final one

Comparing all CNN models with one image

RUN MODELS 10 & 11 40 epochs:

10: no data aug; reg : Dropouts

11: data aug; reg : None

RUN MODELS 11 again without REG

BONUS: Bootstrap sampling with Decision Tree Classifier

What if we could drastically reduce the input variables and still have a reasonable model to predict?

While DNN are so commonly used when task is to detect and classify images, some other approaches exist and sometimes are worth checking!

For the sake of curiosity, let's try a bootstrap sampling with 5 samples, only 150 input variables (PCA) and half the training data!

As we are going to train a Decision Tree Classifier, we'll be using the data_stack data, as image is unrolled in one dimension.

Reducing the dimension using PCA.

Let's check how much of information we conserve using 150 variables (a drastic change of 97% of leave out)

With 150 components, our data conserve around 81% of the information!

Now, we transform our original data so to take only these 150 components:

XAI: SHAP values

Another interesting approach to explain a ML model comes with SHAP values.

image.png

references:

https://shap-lrjball.readthedocs.io/en/latest/example_notebooks/deep_explainer/Front%20Page%20DeepExplainer%20MNIST%20Example.html

threads reporting the issues:

https://github.com/tensorflow/probability/issues/540

https://github.com/slundberg/shap/issues/2189

https://github.com/slundberg/shap/pull/2355